release: gastown-staging -> main#3151
Conversation
…ayor tools on prewarm Three independent fixes for the startAgentInContainer timeout regression introduced by #2974, plus a tighter container-instance cap. 1. Hydration gate (control-server.ts, process-manager.ts) The control server starts accepting requests immediately at boot, while bootHydration runs concurrently and serialises every registry agent + the mayor prewarm through the global sdkServerLock. Fresh /agents/start, /refresh-token, and PATCH /agents/:id/model requests queued behind that work and the DO-side AbortSignal.timeout(60s) fired before they ever got the lock — surfacing as "TimeoutError: aborted due to timeout" and "timeout after 6000ms: ensureSDKServer for <agentId>". A new awaitHydration() promise is awaited at the top of those handlers (before any process.env mutation in the model PATCH path) so they don't compound the queue. 2. Prewarm config matches /agents/start (Town.do.ts, gastown.worker.ts, process-manager.ts) buildPrewarmEnv was constructing KILO_CONFIG_CONTENT from hardcoded defaults (anthropic/claude-sonnet-4.6 / claude-haiku-4.5), so the real /agents/start with the user's actual model triggered ensureSDKServer's "config mismatch, evicting prewarmed server" path on every warm restart — doubling lock-holding time on the critical path the prewarm was supposed to speed up. The /api/towns/:id/mayor-id endpoint now returns the full prewarm context (model, smallModel, kilocodeToken, organizationId) resolved the same way _ensureMayor resolves it, and the container builds the prewarm KILO_CONFIG_CONTENT to match. Falls back gracefully to a skip when the worker hasn't deployed the richer endpoint yet. 3. Mayor workdir + plugin env (agent-runner.ts, process-manager.ts) prewarmMayorSDK called mayorWorkdirForTown (which only returns a string) and went straight to ensureSDKServer's process.chdir, throwing ENOENT on cold containers because createMayorWorkspace only ran from runAgent. Exported ensureMayorWorkspaceForTown so prewarm materialises the workspace first. More critically, buildPrewarmEnv was missing GASTOWN_AGENT_ROLE, GASTOWN_AGENT_ID, and GASTOWN_TOWN_ID — env vars the kilo serve plugin (plugin/index.ts) reads at spawn to decide whether to register mayor tools. Without them the prewarmed server booted with NO mayor tools, and the cache hit on the next /agents/start handed that defective instance back to the user. Now mirrors the mayor- shaped subset of buildAgentEnv. Added an end-to-end test that intercepts createKilo and asserts the env at spawn time. 4. wrangler.jsonc: lower TownContainerDO max_instances from 800 to 500. Verified with pnpm --filter gastown-container test (67/67 pass), pnpm --filter cloudflare-gastown typecheck, oxlint, and pnpm format.
Code Review SummaryStatus: No Issues Found | Recommendation: Merge ✅ All Previously Flagged Issues Resolved
Files Reviewed (all commits)
Reviewed by claude-sonnet-4.6 · 532,046 tokens |
* chore(gastown): remove manual request logging middleware * fix(gastown): unblock /agents/start during boot hydration; preserve mayor tools on prewarm Three independent fixes for the startAgentInContainer timeout regression introduced by #2974, plus a tighter container-instance cap. 1. Hydration gate (control-server.ts, process-manager.ts) The control server starts accepting requests immediately at boot, while bootHydration runs concurrently and serialises every registry agent + the mayor prewarm through the global sdkServerLock. Fresh /agents/start, /refresh-token, and PATCH /agents/:id/model requests queued behind that work and the DO-side AbortSignal.timeout(60s) fired before they ever got the lock — surfacing as "TimeoutError: aborted due to timeout" and "timeout after 6000ms: ensureSDKServer for <agentId>". A new awaitHydration() promise is awaited at the top of those handlers (before any process.env mutation in the model PATCH path) so they don't compound the queue. 2. Prewarm config matches /agents/start (Town.do.ts, gastown.worker.ts, process-manager.ts) buildPrewarmEnv was constructing KILO_CONFIG_CONTENT from hardcoded defaults (anthropic/claude-sonnet-4.6 / claude-haiku-4.5), so the real /agents/start with the user's actual model triggered ensureSDKServer's "config mismatch, evicting prewarmed server" path on every warm restart — doubling lock-holding time on the critical path the prewarm was supposed to speed up. The /api/towns/:id/mayor-id endpoint now returns the full prewarm context (model, smallModel, kilocodeToken, organizationId) resolved the same way _ensureMayor resolves it, and the container builds the prewarm KILO_CONFIG_CONTENT to match. Falls back gracefully to a skip when the worker hasn't deployed the richer endpoint yet. 3. Mayor workdir + plugin env (agent-runner.ts, process-manager.ts) prewarmMayorSDK called mayorWorkdirForTown (which only returns a string) and went straight to ensureSDKServer's process.chdir, throwing ENOENT on cold containers because createMayorWorkspace only ran from runAgent. Exported ensureMayorWorkspaceForTown so prewarm materialises the workspace first. More critically, buildPrewarmEnv was missing GASTOWN_AGENT_ROLE, GASTOWN_AGENT_ID, and GASTOWN_TOWN_ID — env vars the kilo serve plugin (plugin/index.ts) reads at spawn to decide whether to register mayor tools. Without them the prewarmed server booted with NO mayor tools, and the cache hit on the next /agents/start handed that defective instance back to the user. Now mirrors the mayor- shaped subset of buildAgentEnv. Added an end-to-end test that intercepts createKilo and asserts the env at spawn time. 4. wrangler.jsonc: lower TownContainerDO max_instances from 800 to 500. Verified with pnpm --filter gastown-container test (67/67 pass), pnpm --filter cloudflare-gastown typecheck, oxlint, and pnpm format. * feat(gastown): per-route logger tagging via Hono params (review on #3158) --------- Co-authored-by: John Fawcett <john@kilcoode.ai>
…st procedure Adds three dev-only debug endpoints for autonomous convoy testing without going through the mayor LLM: - GET /debug/towns/:townId/rigs — list rigs in a town - POST /debug/towns/:townId/sling-convoy — call Town.slingConvoy() directly - GET /debug/towns/:townId/convoys — list active convoys with progress Documents the new endpoints and adds a Test C section to e2e-pr-feedback-testing.md with a deterministic procedure for verifying review-then-land convoys end-to-end (sub-bead PRs into the convoy feature branch, then a landing PR into main). Also captures known issues observed during verification: container MTU/TLS handshake failures with github.com, 'failed' blockers not gating dependents, and intermittent polecat skipping of sub-PR creation.
… stale stored value resolveGitHubToken previously preferred git_auth.github_token over the platform integration. Since GitHub App installation tokens have a 1h TTL but git_auth.github_token is only written at rig creation (or rare manual refresh), every long-lived town with an integration was handing out an expired token to: - Polecat/refinery 'gh' CLI (via GH_TOKEN derived from GIT_TOKEN in the container), surfacing as 'Failed to log in to github.com using token (GH_TOKEN). The token in GH_TOKEN is invalid.' - The worker-side PR poller (checkPRStatus, checkPRFeedback, mergePR, areThreadsBlocking) — 401 from api.github.com. - The /refresh-git-token endpoint the container falls back to on auth failure — it returned the same expired token, so the retry just re-failed. Verified by hitting api.github.com with a local town's stored token: 401 even though the integration service mints fresh ones fine. Fix: - Flip resolveGitHubToken's priority to github_cli_pat -> live integration -> stored github_token (last-resort fallback for towns with no integration). Empty-string responses from the integration service now warn and fall back instead of silently failing. - Resolve a fresh token at agent dispatch (startAgentInContainer), merge dispatch (startMergeInContainer), and rig setup (setupRigRepoInContainer) before stuffing GIT_TOKEN into envVars. - buildContainerConfig now resolves a fresh token before serializing git_auth.github_token into the X-Town-Config header — the container's syncTownConfigToProcessEnv path reads this on every request to update process.env.GIT_TOKEN, which buildLiveHotSwapEnv then derives GH_TOKEN from on token-refresh hot-swaps. townId is required (not optional) so a forgotten arg can't silently regress to the stale-token shape. - syncConfigToContainer resolves a fresh token before persisting GIT_TOKEN to DO storage for next boot. Adds 6 unit tests covering the priority chain (cli_pat preferred, fresh integration over stale stored, fallback on lookup failure, rig-level integration ID, no-config returns null).
…3160) * fix(gastown): distinguish null causes in PR status polling (#3149) Replace PRStatusResult | null return type with discriminated PRStatusOutcome union in checkPRStatus. Each null cause (no token, HTTP error, invalid response, unrecognized URL, host mismatch) now surfaces a structured PRStatusError with actionable failure messages. - resolveGitHubToken returns GitHubTokenResolution with resolution chain - no_token and non-transient HTTP errors (401/403/404) fail immediately - invalid_response/unrecognized_url/host_mismatch fail after 3 strikes - Transient HTTP errors (5xx/429) keep existing 10-strike behavior - poll_null_count resets to 0 on successful poll at both call sites - failureKind persisted to bead metadata for analytics - AE event pr.poll_failed emitted on terminal failure - Unit tests for checkPRStatus, resolveGitHubToken, failureMessageFor, and threshold logic - Integration test for no_token immediate-fail path * style: apply oxfmt formatting * fix(gastown): track integration source when GIT_TOKEN_SERVICE unbound (review on town-scm.ts:66) When integrationId is set but GIT_TOKEN_SERVICE binding is missing, the configured integration source was silently omitted from the tried array. Add an else branch that pushes the source label with a '(GIT_TOKEN_SERVICE not bound)' annotation so the no_token error message lists all attempted sources. * fix(gastown): fail immediately for unrecognized_url and host_mismatch (review on actions.ts:374) Both are deterministic configuration errors that cannot self-resolve on retry. Move them from the 3-strike bucket to the fail-immediately bucket alongside no_token and non-transient http_error. Only invalid_response remains in the 3-strike category. * fix(gastown): use separate counters for transient vs non-transient poll errors (review on actions.ts:1350) Replace the shared poll_null_count with poll_transient_count and poll_non_transient_count. Each error category increments only its own counter and resets the other, preventing cross-contamination where 9 transient errors followed by 1 non-transient error would incorrectly fail the bead. Legacy poll_null_count is migrated on first read: the transient branch falls back to poll_null_count when poll_transient_count is absent. This ensures in-flight beads at deploy time retain their existing counter value. The non-transient branch does not read the legacy field since these counters reset on every success anyway — at worst an in-flight bead gets one extra retry for invalid_response. * fix(gastown): resolve merge conflict in resolveGitHubToken - merge staging priority with PR #3160 structured return type - resolveGitHubToken now uses staging's priority: cli_pat → integration → stored token - Returns GitHubTokenResolution discriminated union (from PR #3160) - Includes unbound-service else branch (GIT_TOKEN_SERVICE not bound) - Adds resolveGitHubTokenString helper for non-error-aware callers - Updates Town.do.ts, container-dispatch.ts, config.ts to use helper - Updates town-scm.test.ts for GitHubTokenResolution return shape - Updates pr-poll-errors.test.ts for new priority order --------- Co-authored-by: John Fawcett <john@kilcoode.ai>
… awaitHydration in /refresh-token The /refresh-token handler assigned process.env.GASTOWN_CONTAINER_TOKEN before awaiting hydration, inconsistent with PATCH /agents/:id/model which gates first. Mid-hydration token refresh could cause buildPrewarmEnv to pick up a different token than the one hydration captured locally.
…stead of module global The _resolveHydration module-global stale-capture pattern would orphan the first promise's resolver if bootHydration() were ever called concurrently. Capturing resolve as a local inside bootHydration() itself eliminates the risk and removes the module-global.
… getMayorPrewarmContext
getMayorPrewarmContext now returns { agentId } even when the kilocode
token is unavailable (instead of null), so the worker route no longer
needs to fall through to getMayorAgentId. This eliminates the redundant
agents.listAgents SQL query over a second RPC hop.
…igId routes The per-route tagging middleware registered prefixes under /api/orgs/:orgId/... but missed the parallel /api/users/:userId/rigs/:rigId family. Without this, requests to those routes lack rigId in structured log tags.
Review observation dispositionsObservation A — "Request/response logging removed without replacement"Intentional — PR #3158 deletes those manual log lines because Observation C — "Double
|
Summary
Promotes 5 commits from
gastown-stagingtomain. Three independent fix groups plus a developer-facing test procedure:/agents/startduring container boot hydration and preserves mayor tools on prewarm.Also lowers
TownContainerDO.max_instancesfrom 800 → 500 (as part of commit 1).Constituent commits
1. Boot hydration + mayor prewarm fix (
2ffcef28f, direct push)Three independent fixes for the
startAgentInContainertimeout regression observed after #2974, plus a tighter container-instance cap.Symptoms. Production logs were filling with two error patterns since the last
gastown-staging→mainpromotion:Root cause. The control server starts accepting requests immediately at boot (
main.ts:83), whilebootHydration()runs concurrently and serialises every registry agent + the new mayor prewarm through the globalsdkServerLock(createKilo readsprocess.cwd()/process.env). Fresh/agents/start,/refresh-token, and PATCH/agents/:id/modelrequests queued behind that work and the DO-sideAbortSignal.timeout(60s)(resp.REFRESH_AGENT_TIMEOUT_MS=6_000) fired before they ever got the lock.The mayor prewarm added in #3122 made things worse on two axes:
KILO_CONFIG_CONTENTfrom hardcoded model defaults, so the real/agents/startwith the user's actual model triggeredensureSDKServer's "config mismatch — evicting prewarmed server" path on every warm restart, doubling lock-holding time on the critical path the prewarm was supposed to speed up.GASTOWN_AGENT_ROLE,GASTOWN_AGENT_ID, andGASTOWN_TOWN_IDfrom the prewarm env.kilo servesnapshotsprocess.envat spawn, andplugin/index.ts:66keys mayor-tool registration offGASTOWN_AGENT_ROLE === 'mayor'. Without those, the prewarmed server booted with no mayor tools, and the cache hit on the next/agents/starthanded that defective instance back to the user — manifesting as "mayor tools became unavailable."Changes
1. Hydration gate (
control-server.ts,process-manager.ts)New
awaitHydration()exported fromprocess-manager.ts: a promise thatbootHydrationreplaces on entry and resolves in afinally. Awaited at the top of/agents/start,/refresh-token, and PATCH/agents/:id/model(before anyprocess.envmutation in the model PATCH path so concurrent requests can't race on env writes before holding the SDK lock). Default-resolved at module init so test/dev contexts that never run hydration aren't blocked.2. Prewarm config matches
/agents/start(Town.do.ts,gastown.worker.ts,process-manager.ts)New
getMayorPrewarmContext()onTownDOreturns{ agentId, model, smallModel, kilocodeToken, organizationId }resolved the same way_ensureMayorresolves them (config.resolveModel(townConfig, null, 'mayor')). The/api/towns/:townId/mayor-idendpoint now returns that whole context so the container builds aKILO_CONFIG_CONTENTbyte-identical to what the next/agents/startwill send. Falls back to the bare{ agentId }shape for back-compat; the container skips prewarm when model/token aren't available rather than building a config that's guaranteed to mismatch.3. Mayor workdir + plugin env (
agent-runner.ts,process-manager.ts)ensureMayorWorkspaceForTown(townId)soprewarmMayorSDKmaterialises the workspace beforeensureSDKServer'sprocess.chdir(was throwingENOENTon cold containers).buildPrewarmEnvnow mirrors the mayor-shaped subset ofbuildAgentEnv:GASTOWN_AGENT_ID,GASTOWN_AGENT_ROLE='mayor',GASTOWN_TOWN_ID,KILOCODE_FEATURE='gastown',KILO_TEST_HOME,XDG_DATA_HOME. New end-to-end test interceptscreateKiloand asserts those keys are visible to the spawn.4.
wrangler.jsoncLowered
TownContainerDO.max_instancesfrom 800 → 500 (manual change).2. Remove manual request logging middleware (#3158,
a6cf1029b)Removes the redundant request-logging middleware in
gastown.worker.tsthat logged every request twice (-->/<--vialogger.info) — already covered by the per-routeinstrumented(c, route, handler)AE event wrapper. Replaces the regex-basedlogger.setTagsblock with proper per-route tagging using Honoc.req.param()matching for:orgId/:townId/:rigId/:agentIdprefixes. Net diff: ~30 deletions + ~25 additions.Link: #3158
3. Convoy debug endpoints + E2E test procedure (
7f9121ffa, direct push)Adds three dev-only debug endpoints for autonomous convoy testing without going through the mayor LLM:
GET /debug/towns/:townId/rigs— list rigs in a townPOST /debug/towns/:townId/sling-convoy— callTown.slingConvoy()directlyGET /debug/towns/:townId/convoys— list active convoys with progressDocuments the new endpoints and adds a Test C section to
services/gastown/docs/e2e-pr-feedback-testing.mdwith a deterministic procedure for verifying review-then-land convoys end-to-end (sub-bead PRs into the convoy feature branch, then a landing PR into main). Captures known issues observed during verification: container MTU/TLS handshake failures with github.com, 'failed' blockers not gating dependents, and intermittent polecat skipping of sub-PR creation.4. Fresh integration tokens for GitHub auth (
ce15a6fe7, direct push)resolveGitHubTokenpreviously preferredgit_auth.github_tokenover the platform integration. Since GitHub App installation tokens have a 1h TTL butgit_auth.github_tokenis only written at rig creation (or rare manual refresh), every long-lived town with an integration was handing out an expired token to:ghCLI (viaGH_TOKENderived fromGIT_TOKENin the container), surfacing as "Failed to log in to github.com using token (GH_TOKEN). The token in GH_TOKEN is invalid."checkPRStatus,checkPRFeedback,mergePR,areThreadsBlocking) — 401 from api.github.com./refresh-git-tokenendpoint the container falls back to on auth failure — it returned the same expired token, so the retry just re-failed.Fix flips priority to
github_cli_pat→ live integration → storedgithub_token(last-resort fallback for towns with no integration). Empty-string responses from the integration service now warn and fall back instead of silently failing. Resolves a fresh token at agent dispatch (startAgentInContainer), merge dispatch (startMergeInContainer), and rig setup (setupRigRepoInContainer) before stuffingGIT_TOKENinto envVars.buildContainerConfignow resolves a fresh token before serializinggit_auth.github_tokeninto theX-Town-Configheader. Adds 6 unit tests covering the priority chain.5. Distinguish null causes in PR status polling (#3160,
63873e425)Fixes #3149.
Replace
PRStatusResult | nullreturn type with discriminatedPRStatusOutcomeunion incheckPRStatus. Each null cause (no token, HTTP error, invalid response, unrecognized URL, host mismatch) now surfaces a structuredPRStatusErrorwith actionable failure messages.Key changes:
resolveGitHubTokenreturnsGitHubTokenResolutionwith resolution chain tracking which sources were tried (back-compat helperresolveGitHubTokenStringexists for non-error-aware callers).no_tokenand non-transient HTTP errors (401/403/404) fail the bead immediately (1 strike).invalid_response/unrecognized_url/host_mismatchfail after 3 strikes.poll_transient_countandpoll_non_transient_countseparate counters (replaces the cross-contaminated singlepoll_null_count); both reset on successful poll.failureKindpersisted to bead metadata for analytics.pr.poll_failedemitted on terminal failure.resolveGitHubTokentracks the configured integration source even whenGIT_TOKEN_SERVICEbinding is missing.Link: #3160
Verification
bootHydrationis in flight, release when it returns).bootHydrationwith a/mayor-idfetch mock, interceptscreateKilo, assertsGASTOWN_AGENT_ID,GASTOWN_AGENT_ROLE='mayor',GASTOWN_TOWN_ID,GASTOWN_CONTAINER_TOKEN, and a non-emptyKILO_CONFIG_CONTENTare all visible at spawn time)._ensureMayormodel-resolution path to confirmresolveModel(townConfig, null, 'mayor')is byte-identical to what/agents/startwill send (mayor role ignoresrigOverrideentirely inconfig.resolveModel).mayor.ensure_decision: short_circuit_warmandagent.startup_phaseafter merge.pnpm --filter cloudflare-gastown typecheckpasses.test/unit/pr-poll-errors.test.ts(checkPRStatus, resolveGitHubToken),test/unit/pr-poll-thresholds.test.ts(failureMessageFor, shouldFailImmediately, shouldCountAsTransient).test/integration/pr-poll-errors.test.ts.Visual Changes
N/A
Reviewer Notes
/api/towns/:townId/mayor-idresponse shape is back-compat: the container's Zod schema (MayorPrewarmResponse) accepts both the new full-context shape and the legacy{ agentId }shape with.passthrough(), and rolls back to "skip prewarm" on missing fields.organizationIdfallback chain inbuildPrewarmEnvdistinguishesundefined(older worker, fall back toprocess.env) fromnull(worker authoritatively says "no org") so a stale env-var value can't override an authoritativenull.bootHydrationis currently single-call frommain.ts. If we ever add periodic re-hydration, the resolver capture should move to a local insidebootHydration(called out in code review as a SUGGESTION, deferred).prewarmMayorSDKwarns but doesn't bail on workdir-mismatch (cheap to harden later), (b) one negative-case timing assertion in the new test relies on a 10mssetTimeout(test still validates the positive case deterministically).refresh-git-token.handler.tschange is a caller update for the newGitHubTokenResolutionreturn type (wasstring | null).wrangler.jsoncmax_instances change (800→500) is from the boot hydration commit (2ffcef28f).